In this script we assess whether perturbation indicators are associated with any of the technical covariates. We do this via both joint and marginal analyses. In the joint analysis, we run a logistic regression of the perturbation indicators on all technical covariates. We extract \(p\)-values for each individual covariate, as well as the \(p\)-value for the chi-squared test for the group of all covariates. In the marginal analysis, we run logistic regressions of the perturbation indicators on the technical covariates one by one. All analyses are done on the set of cells receiving a non-targeting gRNA, to match the set of cells used in our undercover analysis.

Results

Discussion

Most of the technical covariates appear to have no effect on the propensity of perturbation, with the exception of bio_rep and lane for the Papalexi data and batch for the Schraivogel enhancer screen datasets. Let’s take a closer look at these.

Biological replicate in the Papalexi dataset

Let’s cross-tabulate biological replicate with NTC in the Papalexi data:

df <- lowmoi::get_data_for_pert_prop("papalexi", "eccite_screen") |>
  dplyr::mutate(assigned_grna = factor(assigned_grna, levels = paste0("NTg", c(1:5,7:10))))
  
df |> 
  dplyr::select(assigned_grna, bio_rep) |> table()
##              bio_rep
## assigned_grna rep_1 rep_2 rep_3
##         NTg1    135   111    57
##         NTg2     53    27    26
##         NTg3     67    65   112
##         NTg4    259   100    72
##         NTg5    107   105    93
##         NTg7    177   157   111
##         NTg8      5    23    28
##         NTg9     96    77    88
##         NTg10    66    90    79

It does look like different NTCs are overrepresented in different biological replicates (e.g. NTg3 is overrepresented in rep_3 while NTg4 is overrepresented in rep_1). Let’s verify this using a chi-squared test of independence:

chisq.test(df$assigned_grna, df$bio_rep)
## 
##  Pearson's Chi-squared test
## 
## data:  df$assigned_grna and df$bio_rep
## X-squared = 178.75, df = 16, p-value < 2.2e-16

So there is indeed a very significant association between biological replicate and assigned gRNA.

Lane in the Papalexi dataset

Let’s cross-tabulate lane with NTC in the Papalexi data:

df |> 
  dplyr::select(assigned_grna, lane) |> table()
##              lane
## assigned_grna Lane1 Lane2 Lane3 Lane4 Lane5 Lane6 Lane7 Lane8
##         NTg1     26    31    43    35    39    45    41    43
##         NTg2     15    12    12    14    12    11    14    16
##         NTg3     17    15    17    18    40    48    50    39
##         NTg4     66    54    70    69    30    44    45    53
##         NTg5     22    33    27    25    38    44    54    62
##         NTg7     36    46    38    57    47    83    76    62
##         NTg8      2     2     1     0    10    15    16    10
##         NTg9     30    19    22    25    29    48    36    52
##         NTg10    17    11    20    18    39    44    40    46

One thing that sticks out is that NTg8 is found in way fewer cells than the other NTCs. Let’s look more into this:

df |> dplyr::count(assigned_grna)
##   assigned_grna   n
## 1          NTg1 303
## 2          NTg2 106
## 3          NTg3 244
## 4          NTg4 431
## 5          NTg5 305
## 6          NTg7 445
## 7          NTg8  56
## 8          NTg9 261
## 9         NTg10 235

Indeed, NTg8 is found in only 56 cells, while the other NTCs are usually found in hundreds of cells. Looking at these counts, NTg2 looks a little low as well. Let’s return to the question of association between lane and assigned gRNA. We conduct a chi-squared test:

chisq.test(df$assigned_grna, df$lane)
## 
##  Pearson's Chi-squared test
## 
## data:  df$assigned_grna and df$lane
## X-squared = 175.26, df = 56, p-value = 3.264e-14

We see that the \(p\)-value is quite significant. Let’s exclude the two NTCs with small number of cells to see whether the effect persists:

df_smaller <- df |> dplyr::filter(!(assigned_grna %in% c("NTg2","NTg8")))
chisq.test(df_smaller$assigned_grna, df_smaller$lane)
## 
##  Pearson's Chi-squared test
## 
## data:  df_smaller$assigned_grna and df_smaller$lane
## X-squared = 142.77, df = 42, p-value = 6.653e-13

Indeed, the effect persists.

Batch in the Schraivogel enhancer screen (chromosome 8)

Let’s cross-tabulate batch with NTC in the Schraivogel enhancer screen data for chromosome 8:

df <- lowmoi::get_data_for_pert_prop("schraivogel", "enhancer_screen_chr8") |>
    dplyr::mutate(batch = factor(batch, levels = paste0("sample", 1:14)))
  
df |> 
  dplyr::select(assigned_grna, batch) |> 
  table()
##                      batch
## assigned_grna         sample1 sample2 sample3 sample4 sample5 sample6 sample7
##   non-targeting-00000       0       0       0       0       0       0       0
##   non-targeting-00001       0       0       0       0       0       0       0
##   non-targeting-00002       0       0       0       0       0       0       0
##   non-targeting-00003       0       0       0       0       0       0       0
##   non-targeting-00004       0       0       0       0       0       0       0
##   non-targeting-00005       0       0       0       0       0       0       0
##   non-targeting-00006       0       0       0       0       0       0       0
##   non-targeting-00007       0       0       0       0       0       0       0
##   non-targeting-00008       0       1       0       0       0       0       0
##   non-targeting-00009       0       0       0       0       0       0       1
##   non-targeting-00010       0       0       0       0       0       0       0
##   non-targeting-00011       0       0       0       0       0       0       0
##   non-targeting-00012       0       0       0       0       0       0       0
##   non-targeting-00013       0       0       0       0       0       0       0
##   non-targeting-00014       0       0       0       0       0       0       0
##   non-targeting-00015       0       0       0       0       0       0       0
##   non-targeting-00016       0       0       0       0       0       0       0
##   non-targeting-00017       0       0       0       0       0       0       0
##   non-targeting-00018       0       1       0       0       0       0       0
##   non-targeting-00019       3       0       0       0       0       0       0
##   non-targeting-00020       0       0       0       0       0       0       0
##   non-targeting-00021       0       0       0       0       1       0       0
##   non-targeting-00022       0       1       1       0       0       0       0
##   non-targeting-00023       0       0       0       0       0       1       0
##   non-targeting-00024       0       0       0       0       0       0       0
##   non-targeting-00025       0       0       0       0       0       1       0
##   non-targeting-00026       0       0       0       0       0       1       0
##   non-targeting-00027       0       0       0       0       0       0       0
##   non-targeting-00028       0       0       0       1       0       0       0
##   non-targeting-00029       0       1       0       0       1       0       1
##                      batch
## assigned_grna         sample8 sample9 sample10 sample11 sample12 sample13
##   non-targeting-00000       0       0        0        0        0        0
##   non-targeting-00001       0       0        1        0        0        0
##   non-targeting-00002       0       0        0        0        0        0
##   non-targeting-00003       0       0        0        0        0        0
##   non-targeting-00004       0       0        0        0        0        0
##   non-targeting-00005       0       0        0        0        0        0
##   non-targeting-00006       0       0        1        0        0        0
##   non-targeting-00007       0       0        0        0        1        1
##   non-targeting-00008       0       0        0        0        0        0
##   non-targeting-00009       0       0        0        0        0        0
##   non-targeting-00010       0       0        0        0        0        0
##   non-targeting-00011       0       0        0        0        0        0
##   non-targeting-00012       0       0        1        0        0        0
##   non-targeting-00013       0       0        0        0        0        0
##   non-targeting-00014       0       0        0        0        0        0
##   non-targeting-00015       0       0        0        0        0        0
##   non-targeting-00016       0       0        1        0        0        0
##   non-targeting-00017       1       0        0        1        0        1
##   non-targeting-00018       0       0        0        0        0        0
##   non-targeting-00019       0       0        0        0        1        0
##   non-targeting-00020       0       0        0        0        0        0
##   non-targeting-00021       0       0        0        0        0        0
##   non-targeting-00022       0       0        0        0        0        0
##   non-targeting-00023       0       0        0        0        0        0
##   non-targeting-00024       0       0        0        0        0        0
##   non-targeting-00025       0       1        0        0        0        0
##   non-targeting-00026       0       0        0        0        0        0
##   non-targeting-00027       0       0        0        0        0        0
##   non-targeting-00028       0       0        0        0        0        0
##   non-targeting-00029       0       0        0        0        0        1
##                      batch
## assigned_grna         sample14
##   non-targeting-00000       58
##   non-targeting-00001       72
##   non-targeting-00002       67
##   non-targeting-00003       71
##   non-targeting-00004       44
##   non-targeting-00005       58
##   non-targeting-00006       43
##   non-targeting-00007       52
##   non-targeting-00008       26
##   non-targeting-00009       37
##   non-targeting-00010       28
##   non-targeting-00011       45
##   non-targeting-00012       81
##   non-targeting-00013       33
##   non-targeting-00014       56
##   non-targeting-00015       60
##   non-targeting-00016       79
##   non-targeting-00017       56
##   non-targeting-00018       99
##   non-targeting-00019       32
##   non-targeting-00020       42
##   non-targeting-00021      107
##   non-targeting-00022       54
##   non-targeting-00023       30
##   non-targeting-00024       28
##   non-targeting-00025      141
##   non-targeting-00026       40
##   non-targeting-00027       55
##   non-targeting-00028      120
##   non-targeting-00029       64

One thing that jumps out at us is that the vast majority of NTC-containing cells were sequenced in batch 14. Let’s confirm this:

df |> count(batch)
##       batch    n
## 1   sample1    3
## 2   sample2    4
## 3   sample3    1
## 4   sample4    1
## 5   sample5    2
## 6   sample6    3
## 7   sample7    2
## 8   sample8    1
## 9   sample9    1
## 10 sample10    4
## 11 sample11    1
## 12 sample12    2
## 13 sample13    3
## 14 sample14 1778

Let’s also apply a chi-squared test, calibrated based on permutation:

chisq.test(df$assigned_grna, df$batch, simulate.p.value = TRUE, B = 20000)
## 
##  Pearson's Chi-squared test with simulated p-value (based on 20000
##  replicates)
## 
## data:  df$assigned_grna and df$batch
## X-squared = 476.86, df = NA, p-value = 0.0145

We see that the \(p\)-value is significant, but much less so that in the Papalexi data. However, note that batch may still have a very strong confounding effect when we test targeting gRNAs against NTCs, since in the population of all cells (rather than the population of just NTCs), there is a very strong association between batch and gRNA.

Batch in the Schraivogel enhancer screen (chromosome 11)

In the chromosome 11, again batch 14 dominates:

df <- lowmoi::get_data_for_pert_prop("schraivogel", "enhancer_screen_chr11") |>
      dplyr::mutate(batch = factor(batch, levels = paste0("sample", 1:14)))
df |> count(batch)
##       batch    n
## 1   sample1    6
## 2   sample2    3
## 3   sample3    3
## 4   sample4    6
## 5   sample5    3
## 6   sample6    8
## 7   sample7    2
## 8   sample8    2
## 9   sample9    6
## 10 sample10    8
## 11 sample11    2
## 12 sample13    4
## 13 sample14 3026

However, the chi-square test does not come out significant:

chisq.test(df$assigned_grna, df$batch, simulate.p.value = TRUE, B = 20000)
## 
##  Pearson's Chi-squared test with simulated p-value (based on 20000
##  replicates)
## 
## data:  df$assigned_grna and df$batch
## X-squared = 369.05, df = NA, p-value = 0.2502

It seems that the dominant effect here is not so much that different NTCs are overrepresented in different batches, but that all NTCs are overrepresented in batch 14.